Montessori education's impact on academic and nonacademic outcomes: A systematic review

Abstract Background Montessori education is the oldest and most widely implemented alternative education in the world, yet its effectiveness has not been clearly established. Objectives The primary objective of this review was to examine the effectiveness of Montessori education in improving academic and nonacademic outcomes compared to traditional education. The secondary objectives were to determine the degree to which grade level, Montessori setting (public Montessori vs. private Montessori), random assignment, treatment duration, and length of follow‐up measurements moderate the magnitude of Montessori effects. Search Methods We searched for relevant studies in 19 academic databases, in a variety of sources known to publish gray literature, in Montessori‐related journals, and in the references of studies retrieved through these searches. Our search included studies published during or before February 2020. The initial search was performed in March 2014 with a follow‐up search in February 2020. Selection Criteria We included articles that compared Montessori education to traditional education, contributed at least one effect size to an academic or nonacademic outcome, provided sufficient data to compute an effect size and its variance, and showed sufficient evidence of baseline equivalency–through random assignment or statistical adjustment–of Montessori and traditional education groups. Data Collection and Analysis To synthesize the data, we used a cluster‐robust variance estimation procedure, which takes into account statistical dependencies in the data. Otherwise, we used standard methodological procedures as specified in the Campbell Collaboration reporting and conduct standards. Main Results Initial searches yielded 2012 articles, of which 173 were considered in detail to determine whether they met inclusion/exclusion criteria. Of these, 141 were excluded and 32 were included. These 32 studies yielded 204 effect sizes (113 academic and 91 nonacademic) across 132,249 data points. In the 32 studies that met minimum standards for inclusion, including evidence of baseline equivalence, there was evidence that Montessori education outperformed traditional education on a wide variety of academic and nonacademic outcomes. For academic outcomes, Hedges' g effect sizes, where positive values favor Montessori, ranged from 0.26 for general academic ability (with high quality evidence) to 0.06 for social studies. The quality of evidence for language (g = 0.17) and mathematics (g = 0.22) was also high. The effect size for a composite of all academic outcomes was 0.24. Science was the only academic outcome that was deemed to have low quality of evidence according to the GRADE approach. Effect sizes for nonacademic outcomes ranged from 0.41 for students' inner experience of school to 0.23 for social skills. Both of these outcomes were deemed as having low quality of evidence. Executive function (g = 0.36) and creativity (g = 0.26) had moderate quality of evidence. The effect size for a composite of all nonacademic outcomes was 0.33. Moderator analyses of the composite academic and nonacademic outcomes showed that Montessori education resulted in larger effect sizes for randomized studies compared to nonrandomized studies, for preschool and elementary settings compared to middle school or high school settings, and for private Montessori compared to public Montessori. Moderator analyses for treatment duration and duration from intervention to follow‐up data collection were inconclusive. There was some evidence for a lack of small sample‐size studies in favor of traditional education, which could be an indicator of publication bias. However, a sensitivity analysis indicated that the findings in favor of Montessori education were nonetheless robust. Authors' Conclusions Montessori education has a meaningful and positive impact on child outcomes, both academic and nonacademic, relative to outcomes seen when using traditional educational methods.

met minimum standards for inclusion, including evidence of baseline equivalence, there was evidence that Montessori education outperformed traditional education on a wide variety of academic and nonacademic outcomes. For academic outcomes, Hedges' g effect sizes, where positive values favor Montessori, ranged from 0.26 for general academic ability (with high quality evidence) to 0.06 for social studies. The quality of evidence for language (g = 0.17) and mathematics (g = 0.22) was also high.
The effect size for a composite of all academic outcomes was 0.24. Science was the only academic outcome that was deemed to have low quality of evidence according to the GRADE approach. Effect sizes for nonacademic outcomes ranged from 0.41 for students' inner experience of school to 0.23 for social skills. Both of these outcomes were deemed as having low quality of evidence. Executive function (g = 0.36) and creativity (g = 0.26) had moderate quality of evidence. The effect size for a composite of all nonacademic outcomes was 0.33. Moderator analyses of the composite academic and nonacademic outcomes showed that Montessori education resulted in larger effect sizes for randomized studies compared to nonrandomized studies, for preschool and elementary settings compared to middle school or high school settings, and for private Montessori compared to public Montessori.
Moderator analyses for treatment duration and duration from intervention to follow-up data collection were inconclusive. There was some evidence for a lack of small sample-size studies in favor of traditional education, which could be an indicator of publication bias. However, a sensitivity analysis indicated that the findings in favor of Montessori education were nonetheless robust.
Authors' Conclusions: Montessori education has a meaningful and positive impact on child outcomes, both academic and nonacademic, relative to outcomes seen when using traditional educational methods.
1 | PLAIN LANGUAGE SUMMARY 1.1 | Montessori education significantly impacts academic and nonacademic outcomes Relative to traditional education, Montessori education has modest but meaningful positive effects on children's academic and nonacademic (executive function, creativity and social-emotional) outcomes. This is indicated by a meta-analysis of 32 studies in which it was possible to compare traditional business-as-usual education to Montessori education.

| What is this review about?
How best to educate children is an issue of enduring concern, and Montessori is the most common alternative to the conventional education system. Montessori includes a full system of lessons and hands-on materials for children from birth to 18 years, presented individually, and embedded in a philosophical framework regarding children's development and its optimal conditions. The term Montessori is not trademarked, and, therefore, its implementation can vary. We studied the range of variations included in the literature, which likely reflects the range of implementations encountered in the world. We also compared Montessori with a range of control conditions described in the literature as traditional (sometimes referred to as conventional, or business-as-usual), reflecting the implementation of traditional education in the real world.
What are the main findings of the review?
Using only studies with evidence of baseline equivalence, this review found that Montessori education had a significant positive impact on academic and nonacademic outcomes. Studies with random assignment, elementary school age level, and private Montessori schools had larger effects. The studies took place in eight countries: the USA (18 studies), Turkey (four studies), Switzerland (three studies) and one each in England, France, Malaysia, Oman, Iran, The Philippines, and Thailand.

| How effective is Montessori education?
On academic outcomes, Montessori students performed about 1/4 of a standard deviation better than students in traditional education.
The magnitude of these effects could be considered small when compared to findings obtained in tightly-controlled laboratory studies, but they could be considered to be medium-large to large when compared to studies in real-world school contexts involving standardized tests.
Most (28) of the included studies were conducted in schools implementing Montessori as a full program; the remaining Four studies were short-term add-ons to otherwise traditional school curricula.
The effect sizes for academic outcomes are similar to those obtained in other studies that compared "No Excuses" charter schools to business-as-usual urban schools.
The magnitude of Montessori education's nonacademic effects was slightly stronger than its effects on academic outcomes.
Montessori students performed about 1/3 of a standard deviation higher than students in traditional education on nonacademic outcomes, including self-regulation (executive function), well-being at school, social skills, and creativity.
The magnitude of Montessori education's effects was greater for randomized than non-randomized study designs, greater for preschool and elementary school than for middle and high school, and greater for private Montessori compared to public Montessori settings.

| What do the findings of this review mean?
Across a wide range of implementations (likely reflecting the range of Montessori implementations in the real world) and in studies of moderate to high quality, Montessori education has a nontrivial impact on children's academic and nonacademic outcomes.
1.1.5 | How up to date is this review?
The review authors searched for studies published through February 2020.

| BACKGROUND
Montessori education is the most widespread alternative education, yet its effectiveness is uncertain. Here we describe the problem addressed by the review, then discuss the Montessori intervention and why it might be effective. Finally, we discuss why it is important to do this review.

| Description of the condition
Dissatisfaction with education has been longstanding, with half or fewer American parents satisfied with K-12 education in the United States over the last 30 years. The United States as a whole performs poorly on international tests like the PISA, and national tests like the NAEP show little progress over the years and a severe drop in performance with the COVID-19 pandemic. Achievement is particularly concerning among lower-income children and children of color (Duncan, 2014). A prominent education scholar recently stated, "Preparing all students to meet higher academic standards will require instruction that is different and much better than the instruction that most students receive today" (Duncan, 2014, p. 141). There is also concern about nonacademic outcomes of children, including how to increase their executive function and social-emotional skills (Ahmed, 2021;Jones, 2015). Given that education, as it is usually implemented, has not yielded sufficiently positive outcomes to lead to widespread satisfaction, a question arises as to how alternative forms of education fare in terms of delivered outcomes. One such alternative is Montessori education.
Although Montessori education is the oldest continuously implemented, as well as the most widely implemented alternative education in the world , currently used in over 550 public and 3000 private schools in the United States alone, evidence of its outcomes has not been rigorously compiled. Schooling techniques should be based on evidence of what works; this is why the Department of Education established the What Works Clearinghouse. And yet that Clearinghouse's current (yet outdated) entry for Montessori Method states that as of December 2005, it is unable to draw evidence-based conclusions about the effectiveness of Montessori. Recent reviews also note that its effectiveness has not been clearly established (Ackerman, 2019;Marshall, 2017). This is a problem, given that parents and school districts want to and should make schooling decisions based on evidence.
The main objective of this review is to determine if Montessori education impacts academic and/or nonacademic outcomes of children, and, thus, whether it should be further explored as a possible type of school reform to address the shortcomings of traditional education.
Secondary objectives are to determine if its impacts vary at different ages, in public versus private school settings, with the duration of a child's participation in Montessori, and  cultures (from India to Europe to America) .
Montessori is currently available in over 150 countries.
Through observation, Montessori developed a distinct philosophy of education that was rooted in her medical training (Trabalzini, 2011). She viewed children as biological organisms driven toward their ultimate adult state by internal forces (Montessori, 2012;. When nothing perturbs this development (akin to poor nutrition or other environmental disturbances), she believed children would make optimal choices to propel their own development forward. Thus, in Montessori programs children choose which materials to use at any given time; teacher guidance is given only as needed (i.e., when left on their own, a child's choice is not constructive). Because they are rendered unnecessary (since children have a natural inclination to learn and develop), extrinsic motivators are not employed in Montessori programs. The teacher is able to work with children individually because the Montessori materials themselves do the teaching-they are self-correcting. And, aligning with children's social tendencies, children are allowed to work together as much as they wish. The only requirement is that children are constructive and that they work through all the materials in a classroom environment during the years when they are in the classroom.
The Montessori system has three elements: the environment, the teacher or guide, and the child (Lillard, 2019a;Lillard, 2019b). The system is adapted for different developmental stages (0-3, 3-6, 6-12, and 12-18) and cultures (Montessori, 2012). Yet, because children are biologically the same everywhere and have been for many thousands of years (essentially speaking), the system is highly consistent, such that today's Montessori classroom in Kyoto looks very much the same as one in the highlands of Bhutan, the slums of Mexico City, following nomadic tribes in Kenya, or in 1910 in Rome. Ideally, Montessori classrooms have ample natural light and access to nature (plants and animals in the room, and/or easy access to an outdoor space). Within carefully prepared classroom environments, children encounter an array of brightly colored hands-on materials, one of each type, arranged neatly on accessible shelves into classroom areas (Math, Language, Music, Art, Sensorial Activities, and so on). Each material is available to every child once they have been taught to use it.
Material sets increase in difficulty, serving the youngest to the oldest children in each classroom.
The teacher's role in Montessori is to connect children to the environment by showing them (individually or in small groups) how to use the materials, at a time when each child is judged to be ripe to learn them (Elkind, 2003;Montessori, 1964;Montessori, 2012;Murray, 2010). The teacher spends a great deal of time simply observing the children, judging their ripeness, and figuring out when and how to stoke a child's interest. Teachers undergo a long preparation for their roles, learning about the Montessori philosophy and theory, the subject matter of the classroom, and how to present each material in what is deemed to be a clear and captivating way, as well as developing sensitivity to how children express readiness to learn. Montessori teachers keep records of each child's progress through the sequences of materials, overseeing their learning.
Although often thought of as a private school model for preschool, Montessori actually goes through high school and is also implemented in the public sector. Most of the over 500 public Montessori schools in the United States are Title 1 schools serving children of color (Debs, 2019). Initiatives like Educateurs sans Frontières are increasingly bringing Montessori to the global majority.

| How the intervention might work
Montessori education includes philosophical and structural elements; the structural elements were judged by Maria Montessori to be the best way to implement her philosophy. The intervention might work through either or both of these avenues.
Nine philosophical elements of Montessori were described (along with supporting research) in Montessori: The Science Behind the Genius (Lillard, 2017). We briefly summarize these here, as important means through which the intervention might work. For an alternative model of how the intervention might work, see the logic model developed by Culclasure and colleagues .
1. Montessori education involves hands-on learning; cognition and movement are therefore deeply aligned (see also .
Children use materials that convey important concepts and skills that might transfer to improved academic outcomes.
2. Children in Montessori get to pursue what they are interested in learning at the moment, rather than something a teacher (or state legislature) has chosen for an entire class to learn at once, at a particular moment in time. Interest enhances learning, and being able to do what one is interested in doing also could lead to better nonacademic outcomes, such as positive feelings about or wellbeing in school (Ryan, 2000).
3. Children in Montessori programs choose what they will learn about; they determine how they will spend their time. Research has shown that when children are in environments with more self-determination, their academic performance improves (Cordova, 1996;De Charms, 1976). In addition, so does their perceived self-worth, mastery orientation (Ryan, 1986), and creativity (Amabile, 1984). 4. Montessori also places a high priority on concentrated attention and developing executive function (see Diamond, 2011). Enhanced selfregulation early in life predicts a wide range of health-related and wealth-related outcomes later in life (Moffitt, 2011). 5. Learning stems from intrinsic motivation; there are no extrinsic motivators encouraging children to work in Montessori classrooms. Intrinsic motivation is desirable in itself, and is also associated with lifelong learning (Cordova, 1996;De Charms, 1976;Ryan, 1986). Creativity is also enhanced by a lack of extrinsic rewards (Amabile, 1984). 6. Montessori learning is situated, so a child who is interested in bugs will study actual bugs, not just read about them in texts. In other cases, Montessori's specially-developed hands-on materials make learning situated; e.g., a child learning the Pythagorean theorem gets materials that make the theorem self-evident. The children also embody the theorem in the schoolyard, using ropes to measure triangles, imagining themselves as ancient Egyptians measuring property lines. Learning is enhanced when it is situated in contexts (Cordova, 1996;Lillard, 2017). 7. In Montessori classrooms, children are able to work with peers at will; they learn through imitation, through collaboration, and through peer tutoring. Peers can inspire children to assimilate and accommodate academic and social skills exhibited by those peers (Turner, 1992). Many studies show that peer learning is associated with better outcomes (Topping, 2005).
8. Montessori teachers are counseled to work with children in specific ways, to cultivate the sensitive responsiveness that leads to secure attachment, and to take an authoritative approach; such approaches predict better child outcomes (Baumrind, 1989). Montessori teachers also facilitate studentdriven creative approaches to solving problems (Ultanir, 2012).
9. The Montessori environment is tightly ordered, with everything in its place. Although children have considerable freedom about how to use their time, exactly how a child uses each material is far from random; there are a series of prescribed steps, from which children are permitted to deviate only when the teacher perceives that deviation to be constructive for their learning and development. Order is also associated with better academic and nonacademic outcomes for children (Lillard, 2017).
An educational environment that embodies any or all of these philosophical elements might be expected to result in better outcomes since individual research studies involving each element individually have resulted in better outcomes. Well-implemented Montessori has all nine of these elements, and thus might improve developmental outcomes.
Over her lifetime, as she developed these philosophical elements, Maria Montessori also arrived at a specific pedagogical structure that she believed was optimal for delivering the pedagogy. When the Montessori structure within which these philosophical elements are intended to be embedded is also included in the intervention, it might further enhance outcomes. These stuctural elements are listed below (Lillard 2019a(Lillard , 2019b.
1. Teachers who are well-trained to carry out the intervention, who have learned to be sensitively responsive and authoritative in their implementation of the philosophy, have learned to deliver well the full set of Montessori lessons for the age group they are teaching and know how to tend to the carefully prepared Montessori environment. Thorough Montessori teacher training takes a year and comes from teacher trainers who have undergone an extensive decade-long preparation to convey the philosophy and approach to others.
2. Classrooms that have children of specific 3-year-age spans that are thought to embrace developmental stages (thus all the children in the classroom need particular materials and lessons) and are also thought to be particularly conducive to peer learning.

A full set of specially-designed Montessori materials that enables
hands-on learning and embodied cognition. These materials might also enhance interest, situate cognition, and evoke a sense of order. 4. A 2.5-3 h work period in the morning and afternoon, during which children exercise free choice and concentrate deeply, might assist in the development of executive function. 5. A classroom composition in which there are few adults (and only one trained teacher) and many children, to allow for peer learning and self-determination. Montessori's ideal ratio was about 1:35, with possibly a nonteaching assistant for young children.
These structural elements are part of what is typically considered high-fidelity Montessori, but not all Montessori interventions include them. By contrast, the philosophical elements are likely to be in any intervention that is designated as Montessori.
Although we have described the ideal Montessori intervention based on descriptions in Montessori's books, implementation varies in the real world (Daoust, 2004;Daoust, 2018/;Daoust, 2019). Likewise, implementation would be expected to vary in research studies. The intervention being assessed here is Montessori as it appears in the body of research that purports to study it, which reflects the variation of Montessori in the real world. The actual implementations described in the included studies are covered in the Types of interventions section.

| Why it is important to do this review
It is important to do this review because better evidence is needed for policymakers and parents to know if the Montessori system produces better or worse outcomes than business-as-usual approaches; no prior review has provided definitive evidence. We found only one quantitative meta-analysis of Montessori education and it included only two studies of Montessori (Borman, 2003 There have been several narrative reviews related to Montessori education's impact on academic and nonacademic outcomes, and they have focused especially on preschool-aged children, termed primary in Montessori circles (e.g., Ackerman, 2019;Boehnlein, 1988;Marshall, 2017;Murray, 2010 Although thorough, Jones failed to take a systematic approach to searching the literature and did not quantitatively synthesize the research data. Boehnlein (Boehnlein, 1988; reviewed the Montessori education research that may be of interest to public schools, summarizing the results as follows: • Early research provides evidence that the Montessori method and environment are beneficial to low-and middle-SES children.
• Current research corroborates the early findings, in particular, the importance of the Montessori preschool experience.
• Of specific importance for best results long-term are the full 3-year preschool program, trained Montessori teachers, and multi-age grouping (Boehnlein, 1988, p. 476).
Although relatively thorough, the Boehnlein reviews were narrative reviews without a systematic search strategy or quantitative synthesis. The same is true of recent narrative reviews by Ackerman (2019) and Marshall (2017). In sum, the narrative reviews of Montessori education lack the systematic quality and rigor afforded by a Campbell Collaboration review, and a systematic review of all the existing literature is needed to resolve the question of whether Montessori has an impact on child outcomes.

| OBJECTIVES
The primary objective of this review was to examine the effectiveness of Montessori education, compared to traditional education, in improving academic and nonacademic outcomes for prekindergarten to high-school-aged students. The secondary objectives were to determine which of the following factors moderate the reported effectiveness of Montessori education: grade level, public versus private Montessori settings, type of assignment to experimental and contrast conditions, treatment duration, and length of follow-up measurements. The Campbell protocol for this review can be found in Randolph (2016).
In this subsection, we describe the inclusion and exclusion criteria based on study design features.
• We included studies that used group experimental (i.e., with random assignment) and/or quasi-experimental (i.e., without random assignment) research designs.
• We included pretest-posttest with control group designs, posttest-only with control group designs, and designs with casecontrol matching on a measure of the same construct as the outcome construct.
• We excluded pretest-posttest without control group designs.
Studies that used single-participant, correlational, quantitative descriptive, or qualitative designs were excluded. The portions of mixed-methods studies that met study criteria were included and the other information was excluded.
• We excluded experimental/quasi-experimental studies if they did not meet the following What Works Clearinghouse's (2014) study quality standards. For experimental and quasi-experimental designs, those standards are listed below: o Group membership was determined through a random process, or o Equivalence was established at the baseline for the groups in the analytic sample.
Because of the potential for selection bias, we used strict criteria for establishing baseline equivalency in studies not using random assignment.
Quasi-experimental studies had to meet at least one of the following criteria to be considered for inclusion: • The authors used covariate-adjusted means where at least one covariate was a measure of the same construct as the outcome.
For example, we would consider a study with a mathematics outcome to have baseline equivalency if the authors adjusted for mathematics pretest scores. However, we did NOT consider a study as having baseline equivalency if it only adjusted for covariates that are correlated with the outcome. For example, we would not consider a study with a mathematics outcome to have baseline equivalency if it only adjusted for family income, although family income is known to correlate with academic achievement (Duncan, 2014). In short, covariates had to measure the same construct as the outcome for the study to be considered to have baseline equivalency based on covariate adjustments.
• The authors matched participants based on a covariate that measures the same construct as the outcome construct.
• The authors used gain scores to establish baseline equivalency.
The pretest and posttest scores had to use the same measure or an equatable, scaled measure.
• The authors provided evidence that there was not a statistically significant difference in pretest scores between Montessori education and traditional education groups.
• We excluded studies in which the author did not report enough information to compute standardized mean difference effect sizes.
We did not include studies that required us to impute means and standard deviations from medians and ranges/interquartile ranges.
For studies published since 2000, we attempted to contact authors to get this information if it was not reported in the study and kept documentation of that information in the inclusion/ exclusion data set provided in Randolph (2021).
Studies were included in the current review if they described their intervention condition as Montessori. Not every study specified the elements of Montessori that were implemented and those that described it to varying degrees. The descriptions provided yielded implementations ranging from full (well-aligned with what is described in Montessori's books; 10 studies) to weak (four that were merely add-ons to otherwise traditional programs). This range reflects the wide range of implementation of programs called Montessori in the real world: the term is not trademarked. To be objective, the label alone was used to determine study inclusion. In the Types of Interventions section, we report an analysis of the range of implementations used in the included studies; we also note our grade of the implementation quality in the Intervention characteristics column of Table 1.

| Types of participants
We included studies in which the participants were in preschool, elementary, middle school/junior high school, and/or high school. When a study had participants in two or more of these groups, we classified the age group for the study as the age group with the greatest number of participants. See the coding book in the supplemental information (Randolph, 2021). The "location, setting, status, or definition of the condition and demographic factors" (Campbell Collaboration, 2019b, p. 8) were not considered as inclusion or exclusion factors. We created some exploratory emergently-coded variables for demographic factors (e.g., gender, family income, etc.) and country of origin, but those factors were not included as moderators in this analysis. See the Data extraction and management section for more details on the demographic information collected.

| Types of interventions
The intervention was defined broadly as Montessori education. We operationalized Montessori education as an intervention in which the study authors claimed to have used the Montessori method of education; this occurred in both public and private school settings. In all studies, Montessori education was a separate measurable intervention. Traditional or business-as-usual education was the comparison condition for all studies in the analysis. Next, we describe what these interventions were in the included studies; they align with the range of implementations of Montessori and traditional education in the real world, given that neither term is trademarked.

Montessori interventions
Although the included studies designated the intervention condition as "Montessori," the studies' methods sections provide varying levels of description of the intervention and although most of those descriptions indicated that the intervention had the philosophical elements of Montessori education, some revealed that they deviated in certain ways from the structural ones. Because of this variation, in response to the first set of reviews we categorized (post-hoc) the Montessori conditions into five categories. All of the Montessori interventions appeared sensitive to the philosophical elements of Montessori, for example touching on free choice and hands-on materials in the article Introduction if not also in Methods. Where they varied the most was in structural elements.
Although we performed this categorization, we do not advise considering how effect sizes might vary with the implementation levels because effect sizes stem from many sources including the different effectiveness of the Montessori intervention relative to its control condition; the control or traditional conditions also varied, as they do in the real world. A second reason not to consider these levels for analysis is that the variables that led to the categorizations are our best estimates based on what was reported; they lack precision. In sum, we provide the levels and their descriptions here only to give readers a sense of the range of implementations that were considered in the meta-analysis.
At the highest level was full implementation, meaning that the school was recognized by a respected association like the Association Montessori Internationale (AMI) or the Swiss Montessori Association or at least had teachers who were fully trained by AMI or the American Montessori Society before the intervention took place. AMI recognition entails the structural elements of having AMI-trained teachers (trained to implement the philosophy to a high degree), a specific 3-year-age range in each classroom (e.g., children ages three to six or six to nine), a 2.5-3 h uninterrupted work period during which children are free to choose their own work, large class sizes and few adults, and a full set of Montessori materials. There are also no grades; the materials are self-correcting. The Montessori condition in 10 of the 32 studies appeared to meet these criteria (Denervaud, 2019(Denervaud, , 2020Elben, 2015;Lillard, 2006;Lillard, 2012;Lillard, 2017;Mix, 2017;Rathunde, 2005a;Rathunde, 2005b;Yussen, 1980). The next highest level, observed in six studies, was medium implementation; for these, the article mentions some accreditation or teacher training, but the accrediting organization was unspecified ("a local Montessori association," Prendergast, 1969), and teacher training was done by a respected organization like AMS but not all teachers had completed the training at the time of data collection (Mallett, 2015). In one case the description of Montessori suggested good implementation, but photos included in the article incorrectly called some commercial toys "Montessori materials" (Faryadi, 2017), which put the study in a lower category. In another case, implementation was discussed and a rubric was developed to measure it, and the measure indicated a nontrivial level of deviance from the highest level of implementation among the schools studied (Culclasure, 2018). In another case, the UK Montessori Schools Association had accredited the Montessori schools, but they deviated by including pretend play; this may not be a major deviation but did suggest not fully implementing Montessori (Kirkham, 2017). One study considered to have a Montessori condition at this medium level had AMStrained teachers or teachers in training but used only two ages of children per class instead of three (Manner, 1999).
At the next level, observed in seven studies, Montessori was claimed to be implemented as a full classroom program, but there was RANDOLPH ET AL. 4-year-olds (Ansari, 2014;Galindo, 2014;Jones, 1979;Miller, 1983;Miller, 1984); three of these also had teachers with only a 6-week training course. For one study, it was stated that some classrooms had mixed grades and it specified two grades, suggesting some other classrooms had only single ages (Besançon, 2013). Another study at this level called the intervention modified Montessori and stated that some children had group instruction during part of the session (Coyle, 1968;description taken from Concannon, 1966). These all suggest some serious deviations from high-fidelity Montessori.
There were five studies in which the level of Montessori implementation could not be determined: Fleege (1967), Kayili (2016a), Kayili (2016b), Tobin (2015), Aydoğan (2016). In these studies, Montessori philosophy was described appropriately, but there was not enough information about the classrooms themselves to determine whether the structural elements were in place; one was left with very little sense of how Montessori was actually implemented. In the fifth study in this category, not only was the Montessori implementation unspecified, but the intervention appeared to occur over a 7-week period (Aydoğan, 2016); it is unclear if children had had Montessori programming before the 7- week-long study. sessions with a variety of sensorial activities and role-play activities; it appeared that these built on each other and that each session included the activities from prior sessions. From their introduction, it seems the intervention children were free to choose among these activities. The final study in this group (Juanga, 2015) implemented Montessori for a single day using assigned workstations; the environment was prepared, the activities were hands-on, and a Montessori consultant guided the teacher.
Although there is tremendous variety in these implementations, they reflect the variety of implementations of Montessori in the real world; the term is not trademarked. further details suggested these were also teacher-centered, lecture-style approaches. By contrast, for some (especially older) studies conducted in the United States, traditional education at the preschool level was described as comprised of free play and pretense (Coyle, 1968;Miller, 1983;Miller, 1984;Prendergast, 1969). One study specified that its conventional condition implemented HighScope, which involves centers offering various art, play, and learning activities; this method is modeled after the Perry Preschool Project (Ansari, 2014). Others provided little to no information about their conventional preschool condition, referring simply to the control program as pre-K, traditional, business-as-usual, or non-Montessori. In recent years preschool programs in the United States are more likely to be teacher-centered and involve little play (Bassok, 2016).
In sum, traditional and Montessori conditions reflected a range, across studies, that is reflective of the range of implementations of each in the real world. Although we categorized the Montessori implementation into levels post-hoc, we advise against considering the relation between effect sizes and these estimated levels of implementation, first because our categorization is imprecise, and second because effect sizes reflect the relative difference across Montessori and control conditions in each study, and control conditions varied as well in ways that are impossible to rank precisely.

| Types of outcome measures
We included two broad categories of outcomes: academic and nonacademic. Because of the wide range of nonacademic outcomes, we used an emergent approach to arrive at the specific outcomes being measured in the Montessori literature. We considered academic outcomes to be primary measures and nonacademic outcomes to be secondary measures.

Primary outcomes
The variety of academic outcomes was reduced to five categories of outcomes: • General academic ability. This outcome included measures that did not clearly fit one of the other categories (like math or reading), such as the cognitive subscale of the learning Accomplishment Profile-Diagnostic (Lap-D) (see Ansari, 2014); this subscale included counting as well as matching and did not report scores separately. It also included three subtests of the Bracken scale (see Galindo, 2014) including identifying colors and comparing sizes. Tests of problem solving that were not explicitly mathematical were also categorized as General Academic Ability. Finally, different academic abilities concatenated into a general academic score were included in this category. Individually the academic outcomes had too few studies to do a moderator analysis with cluster-robust variance estimation; therefore, we also created an aggregated academic outcomes variable that comprised studies that contributed at least one effect size to one or more of the academic outcomes listed above.

Secondary outcomes
Although the nonacademic outcomes are listed as secondary outcomes here, we do not intend to assign a hierarchal structure to academic and nonacademic outcomes. After examining the variety of nonacademic outcomes, we found that there were four major nonacademic outcomes: • Creativity. This included measures like Alternate Uses, where children must come up with all possible uses for a common object like a paper clip, the Torrance Test of Creativity (which includes Alternate Uses as one of several tasks), and tests where children need to create a drawing or story and a panel of judges rates their creativity.
• Executive function. Executive function was measured directly with a wide variety of tasks and also indirectly with parent or teacher report questionnaires. The direct tasks include, for example, the Simon-Says-like game "Head Toes Knees Shoulders", the Flanker task, and reciting a string of digits backward. An example of a teacher-or parent-report measure of executive function is the Behavior Rating Inventory of Executive Function or BRIEF.
• Inner experience of school. Some studies addressed how children experience school, asking how much they like school on a survey or using the experience sampling method, whereby pagers randomly beep children and ask them to rate their immediate emotional and cognitive experience; in such studies, the in-school data were used from children attending different types of schools.
• Social skills. Included in this category were both tests assessing children's social knowledge, including tests of basic social cognition (like tests of emotion recognition and the Theory of Mind scale) and tests assessing knowledge about managing peer relations (e.g., asking how one would respond to a social conflict, as with Rubin's Social Problem Solving test). Also included were teacher and parent ratings of children's social behavior (such as the Deveraux Early Childhood Assessment), and live coding of social behavior (e.g., ambiguous rough and tumble play on a playground).
Individually, the nonacademic outcomes had too few studies to do a moderator analysis with cluster-robust variance estimation; therefore, we created an aggregated nonacademic outcomes variable from studies that contributed at least one effect size to one or more of the nonacademic outcomes listed above. Articles and gray literature were gathered using online databases that cover education, sociology, and psychology, and recommendations from experts in the field of Montessori education. Only results written in English were considered for inclusion, although no language limiters were utilized in the searches. We employed both free-text and controlled vocabulary terms in the searches. All permutations of search terms were used during the search process.
We complemented our search with a thorough examination of reference lists of relevant retrieved studies, both included and excluded, and contacted experts in the field to identify any ongoing or unpublished studies.
Studies were identified using the following electronic databases and online sources: • Academic Search Complete (EBSCO) • AERA Online Paper Repository

• American Montessori Society Montessori Research Library
• Arts & Humanities Citation Index (Web of Science)

• Dissertations & Theses Global (ProQuest)
• Education Full-Text/Education Research Complete (EBSCO) • Education Journals (ProQuest) • Professional Development Collection (EBSCO) • Research Library (ProQuest) • Social Sciences Citation Index (Web of Science) • Social Sciences Journals (formerly Social Science Database) (ProQuest) • SocINDEX with Full Text (EBSCO) • Sociological Collection (EBSCO) • Teacher Reference Center (EBSCO) During the search process, we utilized phrase searching and truncation methods to find all variations of relevant search terms.
Database thesauri, when available, were used to find controlled vocabulary descriptors and related descriptors which were integrated into search iterations.
The search strategy was customized as needed for each database and was changed to include Montessori classrooms of all grade levels. Details on the search strategy for each source are provided in the Appendix. Search strategies were tailored to the unique controlled vocabularies of each database and were used in conjunction with free text search terms, which can be found in Supporting Information: Appendix 1. The search strategy included keywords and subject headings pertaining to setting (school), intervention (Montessori), and outcome (academic and nonacademic). The number of search results from each source is provided in Table 2.
There were limitations related to this search. Namely, the search was limited by the omission of studies published since February 2020 and the studies were also limited to English.

| Electronic searches
Databases A comprehensive database search included the following online subscription databases: • EBSCO Academic Search Complete

Contacting other researchers
Authors of prior studies and other experts in the Montessori method were contacted to obtain unpublished research or to get further clarification on published studies. A record of attempts to contact authors can be found in the notes section of the inclusion and exclusion data set provided in the supplemental information to this review (Randolph, 2021).

Prior reviews and reference lists
The prior reviews that were searched are mentioned in the Why it is important to do this review section and there is additional information in the supplemental information provided in Randolph (2021)
Following the suggestions of the American Statistical Association (Wasserstein, 2019), we refrained from null-hypothesis statistical significance testing when possible. However, for readers interested in interpreting our results in the null-hypothesis testing paradigm, we provide Benjamini-Hochberg-adjusted critical alpha values (Benjamini, 1995;. These adjusted critical values and confidence intervals are meant to keep the false discovery rate and false coverage rates, respectively, at an overall 0.05 level. We report the adjusted values separately for primary objectives with 11 main-effects estimates in the supplemental information (Randolph, 2021).

| Selection of studies
After the information retrieval expert for this review initially identified potential studies for inclusion based on titles and abstracts, at least two reviewers independently reviewed the full text and/or abstracts of the study to make decisions about study inclusion and exclusion. Disagreements were resolved by consensus.

| Data extraction and management
The data specified on the coding sheet were extracted independently by at least two reviewers and a consensus decision was reached when there was disagreement.
Our coding sheet was divided into the following categories: We did not include the risk of bias variables related to blinding because blinding was not possible in the educational studies we reviewed. We assumed a high risk of blinding bias for all studies; therefore, we do not have the risk of bias codes related to blinding.
We expected quasi-experimental studies, so we intended to measure what methods and confounding variables the study authors controlled for.
The list below indicates the main variables used in the coding sheet to extract data. The revised coding sheet can be found in the supplemental information provided in Randolph (2021). This list may differ slightly from the one in the protocol because we used an emergent coding approach to create some variables. We created some emergent codes when we were unsure what categories we would find.

| Measures of treatment effect
We used a standardized mean difference effect size (Hedges' g) as the measure of treatment effect. The effect size was calculated in one of three ways: • using the standardized mean difference approach in the metafor package (Viechtbauer, 2010) in R (R Core Team, 2019), • using the standardized mean change in raw score approach in the metafor package in R, • or using an "other" approach with Wilson's (Wilson, n.d
For studies using the posttest-only with control group design, we used the escalc function (measure = "SMD") of the metafor package in R to calculate the effect size and its unbiased estimate of the sampling variance (vtype = "UB"). For pretest-posttest designs with control groups and covariate adjustments, we used the same method described above using covariate-adjusted means.
For pretest-posttest designs with control groups that did not use covariate adjustment or for studies that used case-control matching on a baseline measure of the outcome construct, we assumed the pretest standard deviation was an unbiased estimate of σ and used a raw mean change score approach as suggested in Morris (2002) ( Equation 6). (The lack of pretest-posttest correlations in many studies was the primary reason we adopted Morris's raw mean change score approach.) Namely, we calculated the posttest-pretest mean differences for each group and used the pretest standard deviation of that group as the measure of variance. For designs with case-control matching, we used the case-control group standard deviation as the measure of variance. We then used the standardized mean change raw score parameter (measure = "SMCR") in the escalc  (2015). This allowed us to use every effect size while still accounting for the dependency of effect sizes within studies. Specific information on the cluster-robust approach can be found in the Data synthesis section.

| Dealing with missing data
If outcome data were missing, we attempted to contact the study author(s) to get access to the missing data. If the author(s) never responded to our query, we excluded the study from the analysis. See the supplemental information (Randolph, 2021) for a record of which studies were excluded because of missing data and for documentation of attempts to contact study authors. In some cases, the authors sent the original data set and we used that data to generate the data missing in the study itself. A record of these instances is recorded in the data set and spreadsheet of included/excluded studies in the supplemental information as well.

| Assessment of heterogeneity
We used several methods to address study heterogeneity largely following the methods in Higgins (2021). For each outcome, we examined a forest plot containing the effect size estimate and its 95% confidence interval for each study and the weighted mean effect size and its 95% confidence interval. For outcomes with too many effect sizes to visualize in a forest plot with R software, we created a graphic display of study heterogeneity (i.e., a GOSH plot) (Olkin, 2012).
Although not included here, we also examined Baujat plots, radial plots, residual plots, and various other diagnostic and leverage tables included in the metafor package to identify studies with atypical heterogeneity. Funnel plots were examined for outcomes with 10 or more studies. We investigated kernel density plots of unweighted effect sizes to examine the distributional characteristics of the individual outcomes. In addition to the visual analysis of heterogeneity, we also examined several statistical measures of heterogeneity: the value of Q, its df, and its related p value, and the value of the I 2 statistic. The R code and data set for these analyses can be found in Randolph (2021).

| Assessment of reporting biases
We used two methods to examine the presence and magnitude of publication bias: funnel plots and a trim and fill analysis, First, as suggested in Higgins (2021), we assessed the degree of potential publication bias by visually analyzing funnel plots for outcomes with more than 10 studies and we did not carry out null-hypothesisbased statistical significance tests of publication bias.
Second, we used a trim and fill method as an additional tool to Specifically, we used the trimfill() function, which is based on the work of Duval (2000), in R's metafor package (Viechtbauer, 2010) using L0, R0, and Q0 parameters and using the Sidak-Johnson method for random-effect synthesis. A sensitivity analysis of those parameters, which is not reported here, clearly indicated that the bias was left-sided. Duval (2000) recommends using either the L0 or RO estimate; for the sake of brevity, we only report the trim and fill results using the L0 parameter. The results did not differ substantively between L0 and R0.
As suggested in Shi (2019) Bias in the selection of the reported result was assessed using the RoB2 tool (Sterne, 2019) for studies with random assignment and the ROBBINS-I tool (Sterne, 2016) for studies without randomized assignment.

| Data synthesis
For studies that reported multiple effect sizes for a single outcome, we used a cluster-robust method for synthesizing effect sizes as described in Tanner-Smith (2014), Tanner-Smith (2016), and Tipton (2015) using the robumeta package (Fisher, 2017) in R. We used Tipton (2015)'s small-sample correction and assumed the withinstudy effect size correlation (ρ) to be 0.80. As suggested in Tipton (2015), we investigated this assumption with a sensitivity analysis using a range of values of ρ. We also estimated that dependencies between within-study effect sizes were based more on correlations of effect sizes within studies than hierarchical effects, so we used

| Subgroup analysis and investigation of heterogeneity
For aggregated academic and aggregated nonacademic outcomes, we conducted the following a priori subgroup analyses using clusterrobust meta-regression with the robumeta package (Fisher, 2017) using the methods described in Tanner The social studies outcome only had one study with multiple outcomes so a random-effects model was used. See the R code in the supplement (Randolph, 2021) for more details.
As mentioned in the protocol, we extracted information on student demographic characteristics, such as race/ethnicity, at-risk status, gifted/ talented, or measures of socioeconomic status, but we did not intend to examine these characteristics as moderators in this review. We intended to use these demographic data to richly characterize study participants, to examine whether these variables were used as covariates in study analyses, and to facilitate follow-up reviews that might examine demographic characteristics as moderators.

| Sensitivity analysis
We conducted the following sensitivity analysis. See the related R functions in the supplemental information (Randolph, 2021) for more information; the function names are given in italics below.
• We compared results between cluster-robust, random-effects, and fixed-effects models; es_calc_method() • We examined how ρ (i.e., the correlation of within-study effect sizes) covaried with effect sizes; robust_main() • We conducted a leave-one-out analysis using the leave1out function in metafor the package; robust_main() 4.3.12 | Summary of findings and assessment of the certainty of the evidence We used the GRADE approach described in Higgins (2021) to assess the certainty of evidence and summarize findings. The GRADE approach results in one of four ordinal ratings of the certainty of evidence in an outcome: high, moderate, low, or very low.
The first step in the GRADE approach is to establish an initial level of certainty. Randomized studies or studies evaluated using the ROBINS-I tool (Sterne, 2016) for examining risk of bias in nonrandomized studies are given an initial certainty of high certainty.
Observational studies not using the ROBINS-I tool are given an initial certainty rating of low certainty. The first author independently assessed the certainty of evidence and other authors reviewed those certainty assessments. Any disagreements were resolved through consensus. The fine details of the GRADE approach used in this review are explained below.
All included studies initially were assumed to have an initial level of "high certainty" because they were all randomized trials or were evaluated with the ROBINS-1. We then downgraded or upgraded the level of certainty based on the various GRADE factors.
First, we considered risk of bias. We assumed all studies to have low risk of bias. We then downgraded the certainty of evidence by one or two levels based on the RoB 2 (Sterne, 2019) for studies using random assignment and the ROBINS-I tool (Sterne, 2016) for studies not using random assignment. We used the following criteria for downgrading studies based on their risk of bias: A rating of high certainty evidence can be achieved only when most evidence come from studies that meet the criteria for low risk of bias. The certainty of evidence might be downgraded by one level when most of the evidence comes from individual studies either with a crucial limitation for one item, or with some limitations for multiple items. (Higgins, 2021, p. 392) Furthermore, it was possible to downgrade two levels when there were very serious limitations defined as a "crucial limitation for one or more criteria sufficient to substantially lower their confidence of an effect" (Higgins, 2021, p. 393). See the Assessment of risk of bias in included studies section for more details on risk of bias.
After downgrading for risk of bias, we then downgraded for other GRADE factors: inconsistency, indirectness, imprecision, or publication bias and upgraded for large effects, dose-response effects, or opposing residual bias and confounding as described in Higgins (2021).
We downgraded for inconsistency if an outcome met Higgins' (2021, p. 259) definition of considerable heterogeneity: I 2 values equal to or above 75%.
We downgraded for indirectness when there were indirect comparisons or that could cause "a restricted version of the main review question in terms of population, intervention, comparator, or outcome" as described in Higgins (2021, pp. 393-394).
We downgraded for imprecision up to two levels if an outcome did not meet each of the three following criteria: • The number of participants for that outcome was below the optimal information size. We calculated the optimal information size as the sample size needed for a study given the oft-used Cohen (1988) convention for a "small" effect size (i.e., a standardized mean difference effect size of 0.20), one predictor, α = 0.05, β = 0.80, and a two-sided test. In this case, the optimal information size was 387 as suggested in a sample size table from Randolph (2019).
• The 95% confidence intervals for the standardized mean difference for the outcome included 0.00. We downgraded up to two levels if the 95% CI included small positive effects and small negative effects.
For outcomes with more than 10 studies, we downgraded for publication bias if the funnel plots showed marked asymmetry.
We upgraded for large effects if the standardized mean difference effect size was greater than 0.80 in absolute value, which corresponds with Cohen (1988)'s convention for a "large" effect size in laboratory studies in the behavioral sciences.
We upgraded if there was evidence of a dose-response effect (i.e., if there was evidence from a meta-regression that treatment duration had a positive correlation with the effect size).
Finally, we upgraded for opposing residual bias and confounding if we found strong evidence that "all plausible biases from randomized or nonrandomized studies may be working to underestimate an apparent intervention" (Higgins, 2021, p. 397).
When assigning textual descriptions of the magnitude of effect sizes, we used the conventions of Cohen (1988)    Studies with academic outcomes

| Included studies
• Studies with academic outcomes tended to use the pretest-posttest with control group design (Shadish 2002). (Studies that used pretestposttest without control group designs were excluded).
• The majority of effect sizes of academic outcomes came from studies that used random assignment as evidence of baseline equivalency. Other common sources of evidence of baseline equivalency were nonstatistically significant differences on a pretest or using the pretest as a statistical covariate.
• Most studies with academic outcomes collected their data within one year of completion of the intervention.
• In terms of setting, most studies with academic outcomes were conducted in public settings for both the traditional and Montessori conditions.
• Standardized measures of academic achievement were the most frequently used type of measure in studies with academic outcomes.
• Studies with academic outcomes tended to be conducted in elementary or pre-K settings.
• The vast majority of studies with academic outcomes were conducted in North America.
• Most studies with academic outcomes were published in peerreviewed publications.

Studies with nonacademic outcomes
• Similar to studies with academic outcomes, nonacademic studies most frequently used the pretest-posttest with control group design.
• In contrast to studies with academic outcomes, studies with nonacademic outcomes tended to use nonstatistically significant pretest measures and/or gain scores as evidence of baseline equivalency. Studies with nonacademic outcomes tended not to use random assignment. • Like studies with academic outcomes, studies with nonacademic outcomes tended to collect data within one year of completion of the intervention.
• Studies with nonacademic outcomes were conducted in an approximately equal proportion of public and private settings.
• As expected, studies with nonacademic outcomes did not use standardized tests of achievement.
• Similar to studies with academic outcomes, the most frequently used settings were in elementary and pre-K.
• Most studies with nonacademic outcomes had first authors from North America or Europe.
• Most studies with nonacademic outcomes were published in peerreviewed forums.
The list below provides a link to each of the 32 included studies.
The supplemental information in Randolph 2021 contains a data set where we extracted the information specified in the coding book; specific details on each study can be found there. • a lack of proof of equivalency of Montessori and traditional groups at baseline (n = 58), • did not use experimental or quasi-experimental research design (n = 38), • insufficient information to calculate an effect size (n = 44), • a lack of a Montessori-based intervention (n = 8), • the absence of a traditional, control group (n = 11), • a lack of an academic or behavioral outcome (n = 6), • was a duplicate study (n = 3), • or was irretrievable (n = 6).
Note that exclusion criteria were not mutually exclusive, so a study could have been excluded for one or more reasons. If a study met at least one exclusion criterion, the other exclusion criteria may not have been assessed. An online data set in the supplement (Randolph, 2021) to this review has a list of each article considered for inclusion, which inclusion criteria were met by each study, and notes on selected studies. Six studies were irretrievable as shown in the sheet labeled as irretrievable in the included/excluded studies data set in the supplemental online information.

| Risk of bias in included studies
Overall, the risk of bias for the six randomized studies (Figure 2) was considered to be low. Similarly, the risk of bias for 26 nonrandomized studies was low ( Figure 3). Although it is typical for nonrandomized studies to have overall risk-of-bias ratings of some concerns or high risk of bias, we believe that our nonrandomized studies typically were at low risk of bias because of the strict inclusion criteria we set. For example, nonrandomized studies were excluded if there was not strong evidence for baseline equivalency, which addresses the domains of confounding and selection of participants in the Robbins-I tool (Sterne, 2016).

| Allocation (selection bias)
Selection bias was deemed to be low in the 32 included studies.
For randomized studies, three of the six studies were rated as having low risk for the randomization process; the exceptions were Jones (1979), Miller (1983), and Miller (1984), which were rated as having "some concerns." For the nonrandomized studies,

| Blinding (performance bias and detection bias)
Because of the nature of the intervention, it was not possible to blind participants to whether they were receiving Montessori or traditional education. Therefore, we did not assess the risk of bias in this domain.

| Incomplete outcome data (attrition bias)
All 32 included studies were rated as low risk in terms of attrition bias.

| Selective reporting (reporting bias)
All 32 included studies were rated as low risk in terms of selective reporting bias.

| Other potential sources of bias
Information on other potential sources of bias not listed above can be found in Figures 2 and 3. In short, we deemed there to be low risk of bias from other potential sources.

| Academic outcomes
Main effects for academic outcomes Table 4 and Figure 4 (a histogram of raw effect sizes for academic outcomes) summarize the main effects of Montessori education versus traditional education for academic outcomes. For readers unfamiliar with the interpretation of meta-analytic main-effects tables, we describe the interpretation of each column in Table 4 here before discussing the specific results in the following paragraphs. The first column in Table 4 indicates the outcome. The second column indicates the standardized mean difference effect size, Hedges' g, which is a sample-size corrected version of Cohen's d.  should not be interpreted as being reliable when the df is less than 4.00. The fifth column, I 2 , is a measure of study heterogeneity-the difference in effect sizes among studies-expressed as a percentage.
Finally, the fifth and sixth columns of Table 4 indicate the number of studies and the number of effect sizes, respectively, that contributed to each outcome.
As Table 4  In terms of the individual academic outcomes presented in Table 4, all academic outcomes were in favor of Montessori education. In summary, there was moderate quality of evidence that   (Higgins, 2021). In the following section, we report the results of a moderator analysis for all academic outcomes combined for the following attributes: random vs.
nonrandom assignment, grade level, Montessori setting (public vs. private), treatment duration, and length of follow up. We were unable to reliably conduct a moderator and sensitivity analysis for each individual outcome because there was an insufficient number of effect sizes to do so.
Moderator analyses are typically conducted via a technique called meta-regression (Higgins, 2021). An example of a metaregression results table comparing the effect sizes of studies with nonrandom assignment to random assignment is presented in Table 5.
For readers who lack familiarity with results from cluster-robust meta-regression, we provide some guidance in the paragraphs below.
The first column of Table 5 shows the various levels or categories of the attribute that are being investigated and the number of effect sizes included for that category. Here we are investigating the attribute of assignment (i.e., random vs. nonrandom assignment). There were 72 effect sizes from studies with nonrandom assignment and 41 effect sizes from studies with random assignment. The row category denoted with a superscript a is the reference category.
The regression coefficient in the second column of The fourth column in Table 5 (df) shows the degrees of freedom for each coefficient. According to Tipton 2015, coefficients with a df less than 4 may not be reliable and should be interpreted with caution. Finally, the last column, p, gives the probability that the population parameter for the effect size coefficient in the second column might be zero, given chance. In short, Hedges' g for the reference category, which will always be the first-row category in the meta-regression tables presented here, is the effect size for that reference category. The value in the Hedges' g for any category besides the reference category is the mean effect size difference between that category and the reference category.
Finally, in terms of the substantive interpretation of Grade level. While both within-study and between-study estimates indicate a slightly positive linear relationship between effect size and duration in Montessori education, the wide 95% CIs, the dfs less than four, the nonlinearity, and nonconstant variance of the relationship between effect size and duration lead us to conclude that there is too much uncertainty and ambiguity to interpret these treatment duration results with any meaningful degree of certainty. In short, the results for this moderator were inconclusive.
Four studies included in this review used follow-up measurements of academic outcomes after the Montessori intervention period had ended. We present the results of that analysis in Table 9

Sensitivity analysis for academic outcomes
A sensitivity analysis examines the degree to which methodological or analytical decisions covary with outcomes (Higgins, 2021). In this review, we conducted several types of sensitivity analyses, the results of which would be too voluminous to detail here in entirety. Therefore, we summarize some of the results of the sensitivity analysis narratively in this paragraph and concentrate on one particularly important sensitivity analysis in the remainder of this section.
In terms of the ρ parameter (i.e., the estimated correlation between dependent effect sizes) used in cluster-robust effect size estimation, the results were consistent regardless of the value of the parameter for ρ (i.e., the estimated correlation between dependent effect sizes) we chose. We conducted leave-one-out analyses for applicable outcomes and also found that the leave-one-out results were consistent with what was reported here.
For main effects analyses, we compared random-effects and cluster-robust synthesis methods. The overall results were consistent in terms of point estimates of effect size; however, as expected, the variance estimates differed slightly between random-effects and cluster-robust models. The random-effects models tended to have lower variance than the cluster-robust models, but we used clusterrobust models nonetheless because we believed that the benefits of taking into account the dependencies between effect sizes outweighed slightly more accurate variance estimates, as discussed in Tanner Fixed-effect models had smaller 95% CIs than cluster-robust or random-effects models, as expected, and were less favorable to Montessori education. We attribute this to the fact that fixed-effect models tend to weight studies with large samples heavily compared to other models and fixed-effect models do not take into account studies with multiple effect sizes (Higgins, 2021). We believe that Culclasure (2018) was overweighted in fixed-effect models because it tended to contribute many effect sizes and had sample sizes in the thousands. As we discuss in the Discussion section, Culclasure (2018), as well as Ansari (2014), lacked consistently high treatment fidelity in the implementation of Montessori education and, thus, these studies might likely suppress the effect size in favor of traditional education.
One sensitivity analysis of particular importance was related to our method of effect size estimation. We used either a method for standardized mean differences, a standardized mean raw score method, or an other method where we calculated an effect size using Finally, for those interested in null hypothesis testing, we calculated the Benjamini-Hochberg (Benjamini, 1995;) adjusted α to help readers account for the multiplicity of statistics tests (see Wasserstein, 2019), at least for main effects. A list of those adjustments can be found in the online supplement (Randolph, 2021).

Possible publication bias for academic outcomes
Publication bias is a type of systematic bias that occurs when studies with null or negative effect sizes are withheld from the research record, either because of authors not submitting null or negative F I G U R E 11 Scatterplot of treatment duration and observed effect size for academic outcomes. Positive effect sizes favor Montessori education over traditional education.
T A B L E 9 Cluster robust meta-regression for follow-up years for academic effect sizes (between).  Figure 12). In these funnel plots, the vertical axis represents some measure of sample-size-related variability (inverse sample size, raw sample size, square root of the sample size, log sample size) and the horizontal axis represents the effect size with positive effect sizes favoring Montessori education. Each data point within each funnel represents the correspondence of the standard error and effect size for each effect size. In general, larger studies will have less variance and, therefore, will be located near the top of the funnel whereas small studies will be located near the bottom.
Deviations from a funnel shape may be potential indicators of publication bias. See Higgins (2021)   explained through an analysis of study characteristics. The supplemental information (Randolph, 2021) contains funnel plots for individual academic outcomes.
In addition to a visual examination of funnel plots, we also conducted a trim and fill analysis. Using Duval's (2000) trimandfill algorithm enabled us to determine the number of effect sizes that may have been missing due to publication bias, impute the missing effect sizes to minimize funnel plot asymmetry, and then estimate a new effect size with the imputed effect sizes included.
We provide an example of that procedure here starting in terms of the language/literacy outcome. Figure 13  Duval (2000) notes that the trim and fill effect size should not be interpreted as a publication-bias-adjusted effect size. Instead, they recommend that one should use the difference between the imputed and unimputed effect sizes to gauge the degree of potential bias for a given outcome. For example, the 0.11 standard deviation difference between the imputed and unimputed effect size estimates is a rough estimate of the degree to which publication bias may be present in the Montessori research regarding language and literacy. In summary, we found strong evidence for a publication bias effect on the language literacy outcome. We estimate that the publication bias effect for language/literacy is on the order of about 1/10th of a standard deviation in a way that is systematically biased in favor of Montessori education. However, we believe that the degree of systematic bias in favor of Montessori education is not sufficient to negate the overall finding that Montessori education increases language/literacy outcomes when compared to traditional education.

| Nonacademic outcomes
Main effects for nonacademic outcomes See Figure 15 for a graphical display of study heterogeneity and effect sizes of academic outcomes from a bootstrap analysis of 1,000,000 samples of our nonacademic effect sizes. This plot shows that the plausible values of the summary effect size for nonacademic outcomes range approximately from 0.10 to 0.35 with high study heterogeneity.
In terms of the individual four nonacademic outcomes shown in Table 12, there was moderate quality evidence that Montessori There was also evidence of high study heterogeneity for nonacademic outcomes as demonstrated by the high values of I 2 in   Duval's (2000) trim and fill algorithm to bring symmetry to the funnel plot. Without imputation, Hedges' g was 0.24 compared to 0.13 when effect sizes were imputed.

Moderator analysis for nonacademic outcomes
T A B L E 11 Academic effect sizes with and without left-side trim and fill imputation. Note: Outcomes with an NA indicate that there was enough symmetry that the trim and fill algorithm did not estimate any effect sizes to be missing and, therefore, no effect sizes were imputed. Random-effects effect size estimates were for imputed and nonimputed estimates to facilitate comparison; therefore, these effect size estimates may differ from the cluster-robust estimates presented elsewhere. Social studies and science were not included in this analysis because they had less than 10 effect sizes. L0 means left-side imputation.
Abbreviation: CI, confidence interval.   Figure 20 show the results of an examination of nonacademic effect sizes as a function of treatment duration. The within-study and between-study effects were contradictory; the within-study effect showed a very slight treatment-duration effect in favor of Montessori education (Hedges' g = 0.001, 95% CI [−0.089, 0.091)] and the between-study effect showed the same treatmentduration effect but in favor of traditional education, (Hedges' g = −0.004, 95% CI [−0.011, 0.003] for nonacademic outcomes. For the same reasons explained in the section on treatment duration for academic outcomes, we regard these results as inconclusive. There were too few studies to do a moderator analysis of follow-up studies for nonacademic outcomes.

Sensitivity analysis for nonacademic outcomes
The results of the sensitivity analysis for nonacademic outcomes yielded nearly identical results as for academic outcomes: the difference in results between analytical methods was unremarkable.
In summary, the results were consistent regardless of the value of the parameter for ρ that was chosen. The leave-one-out results differed only marginally from the effect sizes reported when all studies were included. For main effects analyses, when comparing random-effects and cluster-robust synthesis methods, the overall results were consistent in terms of point estimates of effect size; however, as expected, the variance estimates differed slightly between randomeffects and cluster-robust models. Although not reported here, the fixed-effect models had smaller 95% CIs and were less favorable of Montessori education, likely due to implementation in the large sample studies that will be discussed later.
In terms of the differences between effect sizes rendered using different estimation methods for nonacademic outcomes (see Table 17), there was a large effect size difference between the standardized mean raw score change method and the other method  Table 13. However, F I G U R E 14 Kernel density plot of unweighted values of Hedges' g for individual nonacademic outcomes.
F I G U R E 15 Graphical display of study heterogeneity for all nonacademic outcomes combined. In the GOSH plot above, the X-axis (Overall Estimate) shows the Hedges' g effect size. Positive effect sizes favor Montessori education over traditional education. The Y-axis represents a measure of study heterogeneity, I 2 , where higher values represent more study heterogeneity. The black data points are simulated values of the population effect size and heterogeneity.
the mean difference between the two standardized methods, which consisted of the majority of effect sizes, was only 0.06 standard deviations and the Wald test for omnibus differences yielded a p value of 0.382, F(2, 1.57) = 1.57. This suggests we are justified in synthesizing results regardless of the method we used to calculate a study's effect size.
Possible publication bias for nonacademic outcomes Figure 21 is a four-panel funnel plot of all nonacademic outcomes combined. The same asymmetry of small studies in favor of Montessori education that was seen for academic outcomes is also seen here for nonacademic outcomes. Similarly, we believe that some of the variance can be explained through an analysis of study characteristics of the extreme outlying studies and of the studies with a very large sample size, as explained in the Discussion. We conclude that there is some slight asymmetry in small studies in favor of Montessori education but it is unclear whether publication bias is the cause. Asymmetry alone is not sufficient evidence for publication bias (Higgins, 2021). Overall, because of the small degree of symmetry and the analytical methods used, we conclude that the asymmetry does not change the substantive conclusion that Montessori education is favorable for most nonacademic outcomes. Conservatively, we believe that the range of plausible values of the aggregate effect of Montessori education on nonacademic outcomes is represented well in the GOSH plot shown in Figure 15.  (Popper, 1962(Popper, /2014. When the evidence is mixed, as was concluded by two recent narrative reviews of Montessori outcomes (Ackerman, 2019;Marshall, 2017), meta-analyses can resolve the ambiguity by providing a more precise effect size estimate based on a large set of observations drawn from multiple studies and samples (Rosenthal, 2002). The Campbell reviews reflect this view, and the Thirty-two studies that were conducted before February 2020 when data collection closed did meet the criteria. They examined a broad array of outcomes, from academic ones like mathematics and literacy to socialemotional outcomes like creativity and social skills. Because single studies contributed several effects, most analyses used cluster-robust regression.

F I G U R E 19
Forest plot for social skills and behavior. Positive effect sizes favor Montessori education over traditional education.
T A B L E 13 Cluster-robust meta-regression for random assignment for nonacademic effect sizes.
Effect (n of sizes) Hedges' g 95% CI of g df p The studies included public and private schools, and participants ranging from preschool through high school; subsequent analyses examined these and random/nonrandom assignment as moderators.
In contrast to the aforementioned recent narrative reviews that were neutral on the efficacy of the Montessori approach, this quantitative meta-analysis showed that Montessori education had largely positive effects on outcomes, with effect sizes that ranged from small to large in the context of school-based field research with standardized (noncustom) measures (Kraft, 2020). Below we summarize these findings, separating academic from nonacademic outcomes. Note that we believe T A B L E 15 Cluster-robust meta-regression for public versus private Montessori setting for nonacademic effect sizes. our findings might have been unduly influenced (in a direction that is unfavorable to Montessori) by a single study with a very large N that also contributed many effects; this study included schools with varying quality of Montessori implementation as assessed by that study's own fidelity of implementation measurement. This is discussed later.

| Summary of main results
Here we summarize the academic and then the nonacademic findings of our analysis. The outcomes reported here refer generally to all students, not to particular subgroups such as lower-income children.
In addition, we note that several of the included studies have small Ns, and studies with small Ns typically have larger effect sizes to be published. However, our bias analyses revealed relatively little evidence of publication bias, rendering this less of a concern.
6.1.1 | Academic outcomes Both overall and by individual outcome, the meta-analysis showed that Montessori's average effect on academic outcomes was uniformly positive for Montessori, with Hedges' gs ranging from 0.26 (general academic ability, including composites) to 0.05 (social F I G U R E 21 Four sample-size based funnel plot variations for nonacademic outcomes.
T A B L E 18 Nonacademic effect sizes with and without left-side trim and fill imputation. Note: Outcomes with an NA indicate that there was enough symmetry that the trim and fill algorithm did not estimate any effect sizes to be missing and, therefore, no effect sizes were imputed. Random-effects effect size estimates were for imputed and nonimputed estimates to facilitate comparison; there, these effect size estimates may differ from the cluster-robust estimates presented elsewhere. L0 means left-side imputation.

RANDOLPH ET AL.
| 45 of 74 studies). Where math was reported separately, the effect size was 0.22 (based on 36 effects from 12 studies), and for literacy, it was 0.17 (based on 45 effects from 16 studies). For research on school programs (which 28 of the 32 included studies concerned), these are noteworthy effect sizes. In a discussion of effect sizes for school studies involving standardized test scores (most of the academic tests included here), the math finding would be considered large and the literacy effect size medium-large (Kraft, 2020). One recent metaanalysis using school lotteries found the effect of charter school attendance to be much smaller than the effects observed here (Cheng, 2017); specifically, it found average effects on ELA (English Language Arts) of about one-tenth of a standard deviation (0.09), and on mathematics of about two-tenths of a standard deviation (0.19).
Results of charter schools using the "no excuses" model specifically were comparable to those achieved by Montessori (0.25 and 0.17).
We note that the charter school analyses typically were of older grades, and effect sizes typically decline across school years (Hill, 2008). The Montessori effect sizes for science and social studies were considerably smaller, and the lower end of their confidence intervals was negative, but these analyses included only three and one studies (respectively) and just five and three effect sizes, thus we have less confidence in those findings. The academic results are important because they resolve existing ambiguity regarding Montessori's effects. While many studies of Montessori report positive outcomes, the domains attaining significance are not entirely consistent across studies, and there is at least one relatively highly cited published study with one negative academic effect (Lopata, 2005), which did not meet the inclusion criteria for this review. In such circumstances, meta-analytic results are very important (Rosenthal, 2002). Montessori is a constructivist educational approach (Elkind, 2003) and is sometimes equated with discovery learning approaches because children have considerable independence, being free to choose what to work on each day (Lillard, 2017). Discovery learning approaches typically do not have positive academic outcomes when compared with more traditional direct instruction (Klahr, 2004;Mayer, 2004). Studies of Montessori outcomes, like those included in this meta-analysis, most often use traditional direct instruction as the comparison, although some use a more specific counterfactual (like HighScope in the Ansari, 2014 study) and others used conventional U.S. preschools from an earlier era in which preschool was mostly free play. The results of this metaanalysis indicate that on average, across studies, academic outcomes in Montessori are better than those of traditional education, however it was defined in a given study. The implementation range was broad, as described in Types of interventions, reflecting the broad range of programs dubbed "Montessori" in the real world; the same is true for the traditional school implementations. Thus, this study likely reflects the real-world difference between Montessori and traditional school.
The effect sizes are similar to those rendered in a meta-analysis of oversubscribed "no excuses" charter schools, in which the counterfactual is typically underperforming urban public schools.

| Nonacademic outcomes
The average effect size was slightly stronger for nonacademic outcomes (0.33) than academic ones ( However, it may provide outsized experiences in exercising inhibitory control which is an aspect of executive function (see below). Because executive function is a significant predictor of both concurrent (Jacob, 2015) and future outcomes including health, wealth, and criminality (Moffitt, 2011), this Montessori result is highly significant.
There are several potential routes by which Montessori might impact the development of executive function (Lillard, 2017). For example, it has many parallels to mindfulness training, such as emphasizing and cultivating concentrated attention, educating the senses to notice fine gradations in stimuli, taking great care in every movement of the body, and sometimes sitting or walking for a period of time in purposeful silence. Mindfulness training appears to influence executive function through an impact on sustained attention (Leyland, 2019;Zoogman, 2015), and was observed in a meta-analysis to influence social-emotional outcomes more generally (Maynard, 2017). Another, simpler route to inhibitory control specifically is that children have to wait in Montessori because there is only one of most resources. Because there is one copy of each material, if another child is using a material, other children who want to use that material must wait. In addition, properly implemented Montessori has only one teacher (Lillard, 2019a;Lillard, 2019b) Montessori also had a nontrivial impact on one's inner experience of school, which translates to well-being at school (Hedges' g = 0.41); although this result stems from just 10 effects and five studies and has lower evidence quality, its size makes it likely that there is some impact. It makes theoretical sense that Montessori would lead to higher well-being. For example, self-determination is associated with higher well-being (Ryan, 2000), and in Montessori environments, relative to conventional school environments, children are given considerable freedom as long as they use that freedom to constructive ends for their own and others' development . People who attended Montessori as children have higher adult well-being  and recall liking school better during childhood (LeBoeuf, 2022).
Another nonacademic outcome that is related to Montessori was creativity, with a Hedges' g of 0.26; the 95% CI around this effect ranged from −0.21 to 0.74, so the result, to which six studies contributed 24 effect sizes, should be interpreted cautiously. It is conceivable, but not clear from this evidence, that the freedom to consider possibilities (Rinke, 2013), combined with a lack of extrinsic rewards and multiple-choice tests (see Lillard, 2017), fosters creativity in Montessori. Another effect that is less clear is social skills, with a Hedges' g of 0.23 and a 95% CI from −0.02 to 0.49, from nine studies and 23 effect sizes. It is possible that the multi-aged classrooms and sustained relationships with peers due to looping enable advances in social skills; alternately, advanced social skills could be a by-product of increased executive function. Again, we have less confidence in the social skills and creativity effects.
In sum, • Montessori education yielded strong and clear effects on math, literacy, general academic ability, and executive function, • Montessori education's effects on aspects of well being, such as the inner experience of school and school liking, also were strong and appeared reliable, and • Montessori education also appeared to affect social studies, science, creativity, and social skills, but these effects are less clear and need further study.   study reported an implementation score of 6.5/10), but it did provide for free choice and other philosophical elements, and this early experience at four might have led to long-term outcome differences relative to the counterfactual (which was a traditional free play school). Another study of long-term academic outcomes (not included here because it lacked appropriate controls) also found better academic performance for Montessori students (with unclear program implementation) even years after they had left the program (Dohrmann, 2007). Other studies suggest that without regard to implementation, Montessori predicts better nonacademic outcomes (LeBoeuf, 2022; (Tuttle, 2012;Weiland, 2020), here the oversubscribed schools were all AMI (Association Montessori

| Potential moderators
Internationale) recognized Montessori schools. AMI schools adhere to the structural elements (i.e., they hire AMI-trained teachers, which is an intensive 9-month, standardized training with a highly prepared and vetted trainer; they have the specific 3-year age spans, scarce adults and high ratios; and the long work periods and full set of Montessori materials) and they also implement the philosophical elements well. By contrast, teachers trained by the other major training organization, the American Montessori Society (AMS), founded to "Americanize" Montessori education (Rambusch, 1992), are relatively more supportive of conventional American education practices like tests, due dates, worksheets, and whole-class activities RANDOLPH ET AL.
| 47 of 74 (Daoust, 2018, April), which might dilute the immediate effects of the Montessori program as compared to traditional education. Two studies specifically examined the influence of implementation on outcomes and found that the more classic implementation espoused by AMI is associated with better outcomes than supplemented implementations (Lillard, 2012;.
In sum, we speculate that the randomized trials had better outcomes in this meta-analysis because sleeper effects on outcomes occur in response to philosophical aspects of Montessori that exist even when structural implementation is weak, and because two of the random studies, while looking at immediate outcomes, had especially high fidelity Montessori implementation.
To shed further light on why random assignment had an unexpectedly stronger effect than nonrandom assignment in this meta-analysis, more research is needed using random assignment while also coding for Montessori implementation. We would hypothesize that among studies using random assignment, those using a more classic Montessori implementation would have stronger effects.

Age level
Academic effects were strongest for children in Elementary school, at 0.36. As is typical, effects were smaller across middle and high school; effect sizes for academic achievement across a school year decline fairly steadily from K to 12 (Hill, 2008). However, for Montessori,effects at preschool (0.20) were also smaller than effects at Elementary school (0.36). Yet the preschool effects are still notable. For example, one random-assignment study of Head Start, with children enrolled either at age three or age four, found that 13 of 22 effects on language, literacy, and math were significant, and of those that achieved significance, the average effect size was 0.18 SD (Barnett, 2011), similar to our preschool effect which was not limited to those that achieved significance. The greatest gains for most children are typically seen from Kindergarten to first grade (Hill, 2008) when children undergo the famous "5 to 7 shift" (Sameroff, 1996) with biological changes augmenting environmentally-induced ones. In Elementary school, the average growth across each school year is 0.44 (Lipsey, 2012), thus Montessori education may add as much as 80% of the gains expected in an entire school year to children's achievement. Although effect sizes were calculated for middle and high school and were considerably smaller, very few studies contributed to effect sizes at those levels giving us less confidence in those effects. Nonacademic effects were strongest for young children, which may indicate that at those ages children's executive function, creativity, and so on are most malleable.
There are few developmental studies of these abilities, thus we were unable to view these results comparatively.

Public versus private Montessori
As  (Lipsey, 2012 Montessori education is likely to achieve academic results as good as or somewhat better than public Montessori education.
As In summary, we conclude that there is preliminary evidence that private Montessori education leads to moderately greater performance in both academic and nonacademic outcomes than public Montessori education. We can also conclude that there is preliminary evidence that this difference in improvement for private Montessori and public Montessori is likely to be greater for nonacademic outcomes than academic outcomes. We argue that the evidence is preliminary because of the statistical imprecision of our estimates.
The public-private difference might be due to implementation.
Public schools are required to have children take state exams that are designed for traditional school programs; Montessori schools need to adjust their program to prepare for those tests, and this by necessity dilutes Montessori implementation at public schools. Public schools are also required to follow regulations about recess breaks, special classes in art and sports, add-in curricula, and teacher-child ratios and class sizes that compromise the integrity of the Montessori program.
Private schools also sometimes make such compromises, but they are more free to determine their programs.

| Overall completeness and applicability of evidence
Our searches included a variety of academic databases, sources known to publish gray literature, Montessori-related journals, and manual searches of references in retrieved studies. This search led to over 1500 unique records and, of those, 173 were included for full-text review. Of those, 32 studies met the criteria for inclusion.
The exhaustiveness of our search procedure and the number of records found lead us to believe that the Montessori-related research listed in the excluded or included studies sections here represents a nearly complete, or at least, representative set of the quantitative Montessori research literature published before the specified search period, which ended February 2020. The outcomes also represent a broad variety of academic and nonacademic outcomes.
Also, because of the thoroughness of the search method, we conclude that the results presented here are applicable over a wide variety of traditional education and Montessori settings. However, when deciding the degree to which these results might generalize to particular settings or to future studies, consider these notes: • The majority of studies were conducted in elementary or pre-K settings.
• North American studies were predominant, but there was some degree of international representation from Europe, the Middle East, and Asia.
• We only included studies that met strict requirements for demonstrating baseline equivalency; we suspect that including studies that met less stringent baseline-equivalency requirements would have led to somewhat different effect sizes.
In summary, we are confident that the results presented here are drawn from a complete or nearly complete set of studies published during the search period and that the results are applicable over a wide variety of traditional and Montessori settings, to the most common academic and nonacademic outcomes, and across multiple measures of the given construct.

| Quality of the evidence
The results of the risk of bias assessments indicated that the risk of bias in the included studies was low overall. Nonexperimental studies tend to have moderate or high risk of bias (Higgins, 2021), and we attribute our nonexperimental studies' having low risk of bias to our stringent inclusion criteria.
When applying the GRADE criteria, the quality of evidence was downgraded because of lack of precision (e.g., as measured by a nonstatistically significant result), high heterogeneity (e.g., a high I 2 value), and/or asymmetry of funnel plots, which can be an indicator of publication bias. If there were publication bias, we anticipate it would have led to marginally smaller effect sizes since there was a set of studies with small/medium sample sizes and larger effect sizes than would have been expected given chance. We suspect the treatment fidelity may have been compromised in large sample size studies, which tend to have more weight in a meta-analysis than studies with small sample sizes and which tended to contribute more effect sizes than smaller studies. Therefore, we upgraded the quality of evidence for outcomes that included the large-sample studies (Culclasure, 2018and/ or Ansari, 2014 because we believed it was a plausible confounding factor that may have underestimated the treatment effect. We assumed that greater treatment fidelity should result in greater effect sizes in favor of Montessori education over traditional education.
In summary, we conclude that there is moderate to high quality evidence to support our findings for most academic and nonacademic outcomes. See the Summary of Findings table 1 for more-detailed information on the quality of evidence.

| Potential biases in the review process
One limitation that could have biased these results is that we were unable to determine the quality of Montessori implementation in five of the included studies, and in several others we based our estimates on article descriptions that varied in completeness. We have noted that this does reflect the state of Montessori education in the real world. If a study claimed to be a test of Montessori education versus traditional education, and had evidence of equivalent baseline by accounting for key variables or by a lottery design in which children whose parents applied to oversubscribed schools were admitted or not admitted at random, then its findings were included. Some of the studies were done at AMI (Association Montessori Internationale) recognized schools (Denervaud, 2019;Denervaud, 2020;Lillard, 2006;Lillard, 2012;Lillard, 2017;Mix, 2017;Rathunde, 2005a;Rathunde, 2005b;and Yussen, 1980) and another at a school accredited by the Swiss Montessori Association, suggesting high fidelity implementation, but for some studies, implementation quality was known to be of lower quality; for example, both the Ansari (2014) and the three Miller studies (Jones, 1979;Miller, 1983;Miller, 1984) took place at Montessori preschools that had only 4-year-olds, missing the key ingredient of a 3-year cycle and age grouping.
Culclasure (2018) is important to discuss in this regard, because it weighed heavily in our analysis due to its many effects (26) and very large sample (thousands of children contributing to each effect).
Culclasure studied outcome differences for children in 23 public South Carolina Montessori schools. Only schools that passed minimal criteria for being Montessori were included in the study, but even among those included, implementation varied widely. Culclasure had trained Montessori teachers observe in a random selection of 126 classrooms and rate implementation, and they also had teacher surveys. Although half the programs were considered high fidelity by the study's own metrics, the other half were of medium or low fidelity. The teacher survey indicated that 35% of teachers did not think they had all or even most of the materials they needed to teach Montessori. Also, 90% said they used circle time centered around a weekly theme, suggesting they were more teacher-directed than is optimal, and 43% said they supplemented the Montessori program with other materials (which Lillard (2012)  Because implementation varies widely in Montessori schools (Daoust, 2004;Daoust, 2018;Daoust, 2019), and some studies suggest that better outcomes are achieved when implementation is more closely aligned with Montessori principles and practices Lillard, 2019a;Lillard, 2019b), the lack of attention to implementation, or sufficient reporting of implementation fidelity, in some of the studies included in this review is a limitation. The heterogeneity of implementation could be a reason for the asymmetry observed in some of our funnel plots. We expect that effect sizes would be greater had Montessori interventions been completed with greater fidelity. A future analysis might examine whether effect sizes increase as the level of implementation increases.
The effect sizes for the social studies outcome only came from one study (Culclasure, 2018). Therefore, we urge readers to note this important limitation when considering the result.
We did not extract data on whether control and experimental groups were assigned to the same school or whether the unit of randomization was at the individual, classroom, or school level. We suggest that this information be extracted and considered in future reviews.
Another potential bias is that the comparison samples/school programs are heterogeneous. It was outside of the scope of this review to compare Montessori to any specific alternative program.
Thus, although virtually all the control children were receiving traditional or business-as-usual education, traditional schools vary.
Although some specified that this meant teacher-led, whole-class learning using mostly lectures and textbooks, others did not. Our comparison could be described as average Montessori versus average traditional programs. We are unable to say which specific Montessori implementation is better or worse-only that compared to a range of other choices, on average, it produces positive effects.
Other sources of potential bias concern the categorization of the measures into the outcome categories we chose. Others might categorize some outcomes differently. Categorization was done without consideration of any potential impact on results, but it is possible that different categorization choices would yield significantly different results. Our database and codes are available for interested readers to review our categorization choices and redo the analyses with alterations to examine their impact.
Another potential limitation of the meta-analysis is that many studies of Montessori education, including the ones studied here, had relatively small samples (two exceptions are the studies by Ansari, 2014 andCulclasure, 2018). Effect sizes tend to be larger with small-N studies (Kraft, 2020), and this was evident in the funnel plots for many of the outcomes we presented here. Therefore, it is likely that if more studies had been done with large sample sizes, the effect sizes would have been smaller than those reported here. We believe that the GOSH plots ( Figures 5 and 15) robustly display the plausible ranges of population effect size estimates. Furthermore, we have included the code and data set so that researchers can investigate methodological variations that are too cumbersome to report here (e.g., fixed vs. random effects).
There was some evidence (asymmetry in funnel plots) that may point to publication bias in favor of Montessori. The trim and fill estimates in Tables 11 and 18 provide a general estimate of what the population effect sizes might be if data were imputed to make the funnel plot of effect sizes be symmetrical. We encourage Montessori researchers to attempt to publish the results of methodologically sound studies regardless of whether the results are negative, null, or positive and, thereby, help create a more comprehensive research record.
One might be concerned that one of the authors of this metaanalysis (Lillard) also authored three of the included studies. Lillard joined the team after the initial analyses were done, and was not involved in devising the selection criteria, nor in deciding which studies to include, nor in the analyses. Thus, the results were obtained without any opportunity for author bias. Her role was limited to interpretation and writing, summarizing the fidelity of implementation of the interventions, as well as reviewing the broader literature and contextualizing the results.

| Agreements and disagreements with other studies or reviews
Very few reviews of the efficacy of Montessori education have been published in peer-reviewed journals. The results of this meta-analysis of 32 studies are consistent with an earlier meta-analysis that only included two Montessori studies, both unpublished, and focused exclusively on achievement outcomes; it calculated a d of 0.27 (Borman, 2003), similar to our overall academic effect size (Hedges' g = 0.24). Considering qualitative reviews, which were neutral, the results of this analysis reflect more positively on Montessori. Ackerman (2019) concluded that "Montessori programs have the potential to enhance young children's learning and development" (p. 11) but that there was no consistent advantage. Her review had included studies with less rigorous designs than those included here, and lack of rigor in the existing experimental base was a main conclusion of the Marshall (2017) (Whitescarver, 2007). The teacher takes a Rousseauian attitude that given the right conditions (including inspiration and curiosity, which the teacher helps to inspire), children will be kind and happy, and will behave in ways that are constructive for self and society. The teacher believes in, indeed loves, every child and is sure of the Montessori system's capacity to achieve this. Furthermore, in the practice of the AMI, the organization Montessori started to carry on her work, the teacher has gone through intensive training and examination, with a teacher-trainer who themselves had intensive training in an apprenticeship to become a teacher-trainer.
The materials are sets of hundreds of mostly wooden and glass instruments and paper charts designed by Montessori and her collaborators for each age level to convey specific learning; children engage with these materials with their hands or even their full bodies (for further summary, see Lillard, 2019a;Lillard, 2019b). Children are also expected to have had Montessori at the prior level as well, at least from age three on, because Montessori confers specific learning that is bulit on later levels.
Which of these many elements is responsible for better academic and nonacademic outcomes? It may be the wrong question, although there are Montessori programs that eliminate one or more features. If enough programs were found that eliminated a specific one might compare those programs to programs that retain all the standard Montessori features.
For example, the Miller (Jones, 1979;Miller, 1983;Miller, 1984) and Ansari (2014) studies were not fully authentic implementations of Montessori for at least one reason: the classroom had only 4-year-olds.
Some other studies also had limited age groups. The Mallett (2015) and Culclasure (2019) studies were done in public schools where Montessori preK was not necessarily offered. By contrast, some studies state that the school was recognized by AMI (Denervaud, 2019;Denervaud, 2020;Lillard, 2006;Lillard, 2012;Lillard, 2017;Mix, 2017;Rathunde, 2005a;Rathunde, 2005b), each of which suggests the full system was operating, or AMS, which has many elements but a different teacher training and often includes added practices (Daoust, 2018, April

| Implications for research
Further research on Montessori education should attend carefully to implementation; Montessori programs can vary widely, with those recognized by AMI having the strictest implementation, AMS the next most strict, and others sometimes using the name Montessori without implementing the program to any great degree (Daoust, 2004;Daoust, 2018;Daoust, 2019). Although we caution that Montessori is a system such that the whole is likely not the same as the sum of its parts , determining whether and how different implementations influence outcomes is very important.
A second important question for further research not addressed here is examining subsamples such as lower-income children and global majority children. Montessori education, after all, was first designed to serve the needs of lower income students (Montessori, 1964). We extracted some data on these variables for the included studies, but it was outside the scope of this review to do a detailed analysis. We encourage researchers to use our data (Randolph, 2021) to carry out this line of inquiry.
Future research should follow children in Montessori longitudinally. Montessori is most frequently a preschool program and follows a pyramid structure with the fewest programs available at high school.
Most studies of Montessori have been limited to a single data collection point. A few have collected data twice, at pretest and posttest, and very few have followed a sample over several years (Culclasure, 2018;Lillard, 2017) or tested people several years after RANDOLPH ET AL.
| 51 of 74 Montessori schooling was complete (Jones, 1979;Miller, 1983;Miller, 1984; see also Dohrmann (2007), not included here due to lack of baseline equivalency). Recent findings from the Tennessee study of prekindergarten, which showed initially positive gains for children randomly assigned to preK reversed by 3rd grade, raise concerns about early schooling in general .  suggested that this could be due to a lack of curricular alignment between preK and later experiences, with Elementary school teachers focusing on students who had not had preK because they had skill deficits; this lack of focus then led to their later problems. This issue would not be present for children continuing from Montessori preK to Montessori Elementary school programs, but that is very much a concern going from Montessori preK to traditional Elementary Another area important for future research involves experimental study design. There is a dearth of randomized trials and highquality nonexperimental designs that adequately account for baseline differences between treatment and control groups. The use of these designs will reduce the selection threat (Shadish, 2002), which we consider to be the most likely threat in the Montessori research because it is likely that students enrolled in private schools-the setting for most Montessori programs-will differ at baseline on important academic and nonacademic variables. We suggest that if researchers are unable to implement random assignment, they use a baseline variable, such as a pretest, that is a direct measure of the outcome and to account for baseline differences statistically by using the baseline variable as a covariate or by the use of gain scores.

Finally, in line with suggestions of the American Statistical
Association (Wasserstein, 2019), we encourage more meta-analyses of the Montessori research and are making our data publicly available (see Randolph, 2021) to encourage replications and extensions of our review. Permission is universally granted for the use of the inclusion/ exclusion data and the data extracted from the included studies.
Some ideas for follow-ups to this review are provided below.
It may be meaningful for future Montessori meta-analyses to be detailed examinations, in which potential moderators are explored, for each of the individual major outcomes with a sufficient number of studies to do so.
It may also be meaningful to use other analytical methods to see if results are consistent across analytical methods. We used a cluster-robust variance estimation procedure that accounts for dependencies in the data and makes accurate point estimates; however, that procedure is less powerful than other synthesis methods and, therefore, will yield variance estimates that are less precise (Tanner-Smith, 2014; Tanner-Smith, 2016; Tipton, 2015).
Furthermore, we used a strict methodological criterion in terms of baseline equivalency. It may be useful for authors to use a less strict inclusion criterion and, thereby, be able to include a large number of studies that were excluded because of a lack of strong evidence for baseline equivalency. Our inclusion/exclusion spreadsheet lists the studies that were excluded based on baseline equivalency criteria.
We found evidence of asymmetry in funnel plots, which may be an indicator of publication bias. We suggest that follow-up studies examine the publication bias in further detail. Trim and fill methods like those discussed in Shi 2019 may be useful for imputing left-side missing data (L 0 ) and subsequently estimated the degree of bias resulting from study asymmetry.
Finally, we suggest that future research create a rating scale of Montessori and traditional implementation quality and use that as a potential moderator. We imagine that quality of treatment implementation could be an important variable in explaining the high amount of heterogeneity.

ACKNOWLEDGMENTS
Thanks to Allison Snyder and the research assistants she managed, for their help in pulling together the Characteristics of Included Studies table, among many other helpful tasks. Thanks to the many authors who graciously accommodated our requests for additional data.

CONTRIBUTIONS OF AUTHORS
Justus J. Randolph: Overall methodological design, statistical analysis, and programming, data collection, and extraction, supervision, grant writing and management, primary text writer of methods and results sections.  Note: Positive standardized mean differences favor Montessori education over traditional education. Because some studies collected data longitudinally in both control and experimental groups and/or used multiple measures of an outcome, we report the total number of observations instead of the number of participants. a The standardized mean difference effect size reported here is Hedges' g using a cluster-robust method (Tanner-Smith, 2014;Tanner-Smith, 2016;Tipton, 2015).

SOURCES OF SUPPORT
• Wend Collective, USA The Wend Collective provided partial support of this review through a grant to Angeline Lillard.

DIFFERENCES BETWEEN PROTOCOL AND REVIEW
There were several differences between the protocol and review: • We made some small changes to the introductory sections because another subject matter expert (Lillard) was added as a co-author after the protocol had been published.
• We updated our overall procedures to comply with the latest version of the MECCIR conduct standards (Campbell Collaboration, 2019a) and MECCIR reporting standards (Campbell Collaboration, 2019b) and the second edition of the Cochrane Handbook for Systematic Reviews (Higgins, 2021).
• We created more specific inclusion and exclusion criteria to help define what qualifies as baseline equivalency.
• Studies excluded because there was insufficient information to compute the relevant effect size were not narratively described because of the large number of studies excluded for this reason.
However, we have provided readers with access to a spreadsheet that shows reasons for exclusion and notes.
• We adopted the ROBINS-I (Sterne, 2016) tool to assess risk of bias in nonrandomized studies.
• We included Benjamini-Hochberg (Benjamini, 1995) corrected α values in an online supplement for those interested in null hypothesis statistical testing.
• Because almost all studies used continuous outcomes, we used a standardized mean difference effect size (specifically Hedges' g) as the sole effect size metric.
• We provided additional information about three different methods of effect size calculation that were used.
• Because multiple methods of effect-size estimation were used, we conducted a meta-regression sensitivity analysis to see the degree to which those estimation methods covaried with outcomes.
• For outcomes with more than 15 studies, we originally intended to deal with unit-of-analysis issues using multilevel metaanalysis methods described in Viechtbauer (2010) and the methods in Konstantopoulos (2011) or the "pre-defined hierarchy of outcomes" approach. However, because of the strict assumptions of multilevel approaches, we adopted cluster-robust methods (Tanner-Smith, 2014;Tanner-Smith, 2016;Tipton, 2015) that have become more accessible since the protocol had been published.
• Similarly, we adopted cluster-robust models instead of the proposed simple, random effects models for outcomes with 15 or fewer outcomes. The exception is the social studies, in which there was just one study with multiple effect sizes so a random-effects model was used.
• We conducted a sensitivity analysis as suggested in Tanner-Smith (2016) and Tipton (2015) to see how different assumptions about the correlation between within-study effect sizes (ρ) might affect cluster-robust results.
• Because leave-one-out sensitivity analyses have become easy to implement since the publication of the protocol, we conducted a leave-one-out sensitivity analysis.
• We used GOSH plots instead of forest plots when there were too many effect sizes to create a legible forest plot in R.
• We visually examined Bajaut plots, radial plots, and various types of residual and fit plots to examine study heterogeneity.
• An examination of funnel plots revealed an unexpectedly high degree of asymmetry with the bias in favor of Montessori education (i.e., studies were missing on the left/traditional education side of the funnel plots) on some outcomes. To make better decisions about the quality of evidence, we took the position of Duval (2000) that a trim-and-fill analysis would help us quantify the degree of publication bias while steering away from binary null hypothesis statistical significance tests of publication bias, as suggested by Higgins (2021). Although we did not intend to conduct a sophisticated analysis for publication bias, we think that the ability to quantify the effects of the potential publication bias on effect sizes warrants this deviation from the protocol.
• The protocol was somewhat unclear in the moderator/subgroup analyses that would be conducted, so we clarified them here.
• In the protocol, we proposed examining academic and behavioral outcomes. After using an emergent approach to identify specific outcomes, we determined that the term nonacademic outcomes is a more accurate descriptor than behavioral outcomes. We found that the variety of nonacademic outcomes measured in the Montessori literature was so broad, and not limited to just behavioral outcomes, that the best descriptor is simply nonacademic.